Project Introduction

On January 12, 2010, a magnitude 7.0 earthquake struck Haiti causing significant damage which affected approximately 3 million citizens. In the wake of the disaster, aid groups were working to locate displaced persons and provide them with food and water. However, due to the large scale destruction of infrastructure over a wide area additional assistance was needed to locate people quickly.

Little is left of a neighborhood on a hillside near downtown Port-au-Prince on Jan. 15. More than a million people were displaced by the quake. (David Gilkey/NPR)

Displaced persons were known to be making make-shift shelters out of blue tarps. High resolution geo-refereneced images were captured by aircraft of the destroyed areas. The data generated by the image collection was too large for aid workers to process in time to supply aid. Therefore, a team from the Rochester Institute of Technology used data-mining algorithms to analyze the images and identify blue tarps. The goal was to effectively locate displaced persons and communicate their location to rescue workers so they could get resources to people who needed it in time.

Sample image of a geo-referenced image used for the analysis

As the final project for SYS 6018 - Data Mining, we were assigned to build models from the different techniques we learned in the course to, as accurately as possible, and in as timely a manner as possible, locate as many of the displaced persons identified in the imagery data so that they could be provided food and water before their situations became unsurvivable. The data made available to students consisted of a csv of red, green, blue pixel values and a class indicator which indicated if a pixel was representative of a blue tarp or something else like vegetation. A final hold-out data set presented in the format of multiple text files was provided as well.

Project Budget

The US Government spent $1.5B on Haiti disaster relief by the end of 2010. For this project, we will assume that 5 million dollars were allocated to our team to deliver supplies to displaced individuals in the immediate aftermath. Our team has been assigned an area where 8,000 displaced people are expected to be. Anything less than a 85% delivery success rate will be considered a disaster relief failure. 90% of 8,000 people is 7,200.

Budget $5,000,000
Cost per Delivery (True Positive) -$1,000
Cost per Mis-Delivery (False Positive) -$500
Cost per Missed Delivery (False Negative) -$6,250
Cost per True Negative $0
  • method of delivery
  • cost of resources
  • time from disaster to unsurvivable conditions

Data Exploration

tic("Knit Time")

The data provided for analysis was generated from overhead images and stored as a three channel output. Each pixel also had a classifier label indicating whether it was a blue tarp or something else like vegetation or soil. The channels represented the red, green, and blue values for pixels within images. RGB color model is referred to as an additive model. The integer value for the red, green, and blue channels are combined to represent a color. Typically, the component values are stored as an 8 bit integer ranging from 0 to 255.

Check NA

df <- tibble(read.csv("HaitiPixels.csv")) #read in df
"Check for NA values" 
anyNA(df) #check for NA values 
"Summary of Data"
summary(df) #quick look at data
df$Class <- factor(df$Class) #make Class a factor variable. 
#> [1] "Check for NA values"
#> [1] FALSE
#> [1] "Summary of Data"
#>     Class                Red          Green            Blue      
#>  Length:63241       Min.   : 48   Min.   : 48.0   Min.   : 44.0  
#>  Class :character   1st Qu.: 80   1st Qu.: 78.0   1st Qu.: 63.0  
#>  Mode  :character   Median :163   Median :148.0   Median :123.0  
#>                     Mean   :163   Mean   :153.7   Mean   :125.1  
#>                     3rd Qu.:255   3rd Qu.:226.0   3rd Qu.:181.0  
#>                     Max.   :255   Max.   :255.0   Max.   :255.0

Scatter and Correlation

#Reference [1]
# The palette with grey:
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73",  "#CC79A7", "#0072B2", "#D55E00")
#"#F0E442",

#view scatter and correlations  
p <- ggpairs(df[,2:4], lower.panel = NULL, upper = list(continuous = wrap("cor", size = 3)), aes(color=df$Class)) #+ scale_fill_manual(values=cbPalette) 

#Reference: https://stackoverflow.com/questions/34740210/how-to-change-the-color-palette-for-ggallyggpairs/34743555
for(i in 1:p$nrow) {
  for(j in 1:p$ncol){
    p[i,j] <- p[i,j] + 
        scale_fill_manual(values=cbPalette) +
        scale_color_manual(values=cbPalette)  
  }
}

p

attach(df) #attach df variables 

3D Scatter

fig <- plot_ly(df, x=~Red, y=~Green, z=~Blue, color=~Class, colors=c("#999999", "#E69F00", "#56B4E9", "#009E73",  "#CC79A7")) #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig <- fig %>% add_markers()
fig <- fig %>% layout(scene=list(xaxis=list(title="Red"),
                     yaxis = list(title = 'Green'),
                     zaxis = list(title = 'Blue')))

fig

!!!!!!!!!! IF I HAVE TIME MAKE A SELECTOR TO CHOOSE COLOR SCHEME FOR NOT COLOR BLIND OR DIFFERENT KINDS OF COLOR BLIND https://socviz.co/refineplots.html

Initial inspection of the data frame indicated no missing values. The data provided is sufficiently cleaned only one further adjustment to the data frame is needed. Since our main interest is to predict whether a pixel represents a blue tarp or not a blue tarp, the Class column of the data frame needs to be converted into a binary indicator for blue tarp or not blue tarp. This is done in the next section.

The data was visualized with the ggpairs function. For a pair of variables chosen from the data frame, Ggpairs generates a scatterplot, displays a Pearson correlation, and, on the diagonal, shows a variable distribution. The plots were also color-coded by class. The class label describes what kind of object a pixel is associated with. In our data frame there were the following classes: Blue Tarp, Rooftop, Soil, Various Non-tarp, and Vegetation. The 2D representation of the data only gives us a partial insight into the behavior and relationships of the predictors. Since three channels are used to generate a color, plotting the data in 3D to investigate trends and behavior between classes will be more meaningful.

The 3D scatter plot shows a significant amount of overlap between the different classes. It is worth noting that it is possible to see some separation between the classes.

Prepare Data Frame for Analysis


Data Frame

df <- cbind(mutate(df, "Blue_Tarp_or_Not"=ifelse(Class != "Blue Tarp", 0, 1))) #add binary column indicating whether the Class variable is "Blue Tarp" or not
attach(df)
df$Blue_Tarp_or_Not <- factor(Blue_Tarp_or_Not, labels = c("NBT", "BT"))#, levels =c(0,1), labels = c("NBT", "BT")) #ensure new column is a factor 
"First Six Rows of Data Frame"
head(df)
df_factor  <- df[, -1]
"Last Six Rows of Data Frame"
tail(df_factor)
attach(df_factor)
#> [1] "First Six Rows of Data Frame"
#>        Class Red Green Blue Blue_Tarp_or_Not
#> 1 Vegetation  64    67   50              NBT
#> 2 Vegetation  64    67   50              NBT
#> 3 Vegetation  64    66   49              NBT
#> 4 Vegetation  75    82   53              NBT
#> 5 Vegetation  74    82   54              NBT
#> 6 Vegetation  72    76   52              NBT
#> [1] "Last Six Rows of Data Frame"
#>       Red Green Blue Blue_Tarp_or_Not
#> 63236 136   145  150               BT
#> 63237 138   146  150               BT
#> 63238 134   141  152               BT
#> 63239 136   143  151               BT
#> 63240 132   139  149               BT
#> 63241 133   141  153               BT

3D Scatter - Binary

fig1 <- plot_ly(df_factor, x=~Red, y=~Green, z=~Blue, color=~Blue_Tarp_or_Not, colors = cbPalette) #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig1 <- fig1 %>% add_markers()
fig1 <- fig1 %>% layout(scene=list(xaxis=list(title="Red"),
                     yaxis = list(title = 'Green'),
                     zaxis = list(title = 'Blue')))

fig1

After the class label is converted into a binary classifier, it is easier to see separation between the data points for blue tarps and not blue tarps.

Model Fitting

Logistic Regression

Fit a Logistic Regression Model ….

Model

#pass
fitControl <- trainControl(method = "cv",
                           number = 10,
                           returnResamp = 'all',
                           savePredictions = 'final',
                           classProbs = TRUE) 

set.seed(4)
tic("Log. Reg.")
glm.fit <- caret::train(Blue_Tarp_or_Not~Red+Green+Blue,
                    data = df_factor,
                    method="glm",
                    family="binomial",
                    trControl= fitControl)
glm.time <- toc(quiet = TRUE)
glm.fit

"Summary"
summary(glm.fit)
#> Generalized Linear Model 
#> 
#> 63241 samples
#>     3 predictor
#>     2 classes: 'NBT', 'BT' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 56917, 56917, 56917, 56916, 56917, 56917, ... 
#> Resampling results:
#> 
#>   Accuracy   Kappa    
#>   0.9952879  0.9207043
#> 
#> [1] "Summary"
#> 
#> Call:
#> NULL
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -3.4266  -0.0205  -0.0015   0.0000   3.7911  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  0.20984    0.18455   1.137    0.256    
#> Red         -0.26031    0.01262 -20.632   <2e-16 ***
#> Green       -0.21831    0.01330 -16.416   <2e-16 ***
#> Blue         0.47241    0.01548  30.511   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 17901.6  on 63240  degrees of freedom
#> Residual deviance:  1769.5  on 63237  degrees of freedom
#> AIC: 1777.5
#> 
#> Number of Fisher Scoring iterations: 12

ROC

Test model performance on Train data to select threshold values…

#pass
glm.prob <- predict(glm.fit, newdata=df_factor , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
glm_roc <- roc(df_factor $Blue_Tarp_or_Not, glm.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="Log. Reg. ROC Curve") 

Thresholds

roc.info_glm <- roc(df_factor$Blue_Tarp_or_Not, glm.prob[,2], legacy.axes=TRUE)
roc.glm.df <- data.frame(tpp=roc.info_glm$sensitivities*100, fpp=(1-roc.info_glm$specificities)*100, thresholds=roc.info_glm$thresholds)
#roc.glm.df[roc.glm.df>98.5 & roc.glm.df < 99,]

glm.thresholds <- data.matrix(roc.glm.df$thresholds)

fig2 <- plot_ly(roc.glm.df, x=~tpp, y=~fpp, z=~thresholds, colors = "#0072B2") #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig2 <- fig2 %>% add_markers()
fig2 <- fig2 %>% layout(scene=list(xaxis=list(title="True Positive Rate"),
                     yaxis = list(title = 'False Positive Rate'),
                     zaxis = list(title = 'Threshold')))

fig2
Confusion_Matrix <- set_names(seq(0.01,0.99,0.01)) %>% 
  map( ~ Blue_Tarp_or_Not(factor(ifelse(synth_dat$p>=.x,'T','F')), factor(synth_dat$y)))

Cost_Benefit_Matrix <- matrix(c(0,-5000,-300,-1000), nrow=2, ncol=2, byrow=TRUE)
  
Cost <- Confusion_Matrix %>%
  map( ~ sum(.x * Cost_Benefit_Matrix))

Cost_actual <- Confusion_Matrix %>%
  map( ~ sum((.x*Cost_Benefit_Matrix)[2,]))

tibble(threshold=as.numeric(names(Cost)), 
       cost=unlist(Cost), cost_actual=unlist(Cost_actual)) %>%
  ggplot() + geom_line(aes(x=threshold, y=cost, color='perceived'), size=3) +
  geom_line(aes(x=threshold, y=cost_actual, color='actual'), 
            size=2, linetype='dotted') +
  ylim(c(-4000000,0)) + scale_fill_manual(values=c('forestgreen','purple')) +
  geom_abline(slope=0, intercept=-2000000)

Confusion Matrix

lr.thresh <- 0.003
glm.pred_thresh <- factor(ifelse(glm.prob[,2]>lr.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.glm_thresh <- confusionMatrix(factor(glm.pred_thresh),df_factor $Blue_Tarp_or_Not, positive = "BT") 
"Threshold:"
lr.thresh
cm.glm_thresh

acc_LR <- cm.glm_thresh[["overall"]][["Accuracy"]]*100
auc_LR <- glm_roc[["auc"]]
thresh_LR <- lr.thresh
sens_LR <-  cm.glm_thresh[["byClass"]][["Sensitivity"]]*100
spec_LR <- cm.glm_thresh[["byClass"]][["Specificity"]]*100
FDR_LR <- ((cm.glm_thresh[["table"]][2,1])/(cm.glm_thresh[["table"]][2,1]+cm.glm_thresh[["table"]][2,2]))*100
prec_LR <- cm.glm_thresh[["byClass"]][["Precision"]]*100

cost.glm <- cm.glm_thresh[["table"]][1]*0 + cm.glm_thresh[["table"]][2]*FP_C + cm.glm_thresh[["table"]][3]*FN_C + cm.glm_thresh[["table"]][4]*TP_C
cost.glm
#> [1] "Threshold:"
#> [1] 0.003
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   NBT    BT
#>        NBT 55346     4
#>        BT   5873  2018
#>                                           
#>                Accuracy : 0.9071          
#>                  95% CI : (0.9048, 0.9093)
#>     No Information Rate : 0.968           
#>     P-Value [Acc > NIR] : 1               
#>                                           
#>                   Kappa : 0.3753          
#>                                           
#>  Mcnemar's Test P-Value : <2e-16          
#>                                           
#>             Sensitivity : 0.99802         
#>             Specificity : 0.90407         
#>          Pos Pred Value : 0.25573         
#>          Neg Pred Value : 0.99993         
#>              Prevalence : 0.03197         
#>          Detection Rate : 0.03191         
#>    Detection Prevalence : 0.12478         
#>       Balanced Accuracy : 0.95104         
#>                                           
#>        'Positive' Class : BT              
#>                                           
#> [1] -4979500

Sampling Variability

"10 Fold Results"
glm.fit$resample #point est +/- std from 10 folds "variation in the third decimal place"... 
glm.sd <- sd(glm.fit[["resample"]][["Accuracy"]]*100)
#plot(glm.fit[["resample"]][["Accuracy"]], main="Accuracy per Fold", xlab= "Fold Number", ylab="Accuracy")
#> [1] "10 Fold Results"
#>     Accuracy     Kappa parameter Resample
#> 1  0.9955724 0.9259239      none   Fold01
#> 2  0.9963631 0.9393021      none   Fold02
#> 3  0.9950980 0.9169589      none   Fold03
#> 4  0.9943083 0.9038085      none   Fold04
#> 5  0.9950980 0.9177836      none   Fold05
#> 6  0.9960468 0.9340241      none   Fold06
#> 7  0.9944655 0.9052940      none   Fold07
#> 8  0.9955724 0.9262890      none   Fold08
#> 9  0.9954143 0.9223164      none   Fold09
#> 10 0.9949399 0.9153427      none   Fold10

The average accuracy across ten folds is 90.71 with a standard deviation of 0.064.

LDA

Model

#pass
fitControl <- trainControl(method = "cv",
                           number = 10,
                           returnResamp = 'all',
                           savePredictions = 'final',
                           classProbs = TRUE)

set.seed(4)
tic("LDA")
lda.fit <- caret::train(Blue_Tarp_or_Not~Red+Green+Blue,
                    data = df_factor,
                    preProcess=c("center","scale"),
                    method="lda",
                    verbose= FALSE,
                    trControl= fitControl)
lda.time <- toc(quiet = TRUE)
lda.fit
#> Linear Discriminant Analysis 
#> 
#> 63241 samples
#>     3 predictor
#>     2 classes: 'NBT', 'BT' 
#> 
#> Pre-processing: centered (3), scaled (3) 
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 56917, 56917, 56917, 56916, 56917, 56917, ... 
#> Resampling results:
#> 
#>   Accuracy  Kappa    
#>   0.983887  0.7524541

ROC

#pass
lda.prob <- predict(lda.fit, newdata=df_factor, type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
lda_roc <- roc(df_factor$Blue_Tarp_or_Not, lda.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="LDA ROC Curve") 

Thresholds

roc.info_lda <- roc(df_factor $Blue_Tarp_or_Not, lda.prob[,2], legacy.axes=TRUE)
roc.lda.df <- data.frame(tpp=roc.info_lda$sensitivities*100, fpp=(1-roc.info_lda$specificities)*100, thresholds=roc.info_lda$thresholds)
#roc.lda.df[roc.lda.df>91.5 & roc.lda.df < 91.6,]

fig3 <- plot_ly(roc.lda.df, x=~tpp, y=~fpp, z=~thresholds, colors = "#0072B2") #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig3 <- fig3 %>% add_markers()
fig3 <- fig3 %>% layout(scene=list(xaxis=list(title="True Positive Rate"),
                     yaxis = list(title = 'False Positive Rate'),
                     zaxis = list(title = 'Threshold')))

fig3

Confusion Matrix

lda.thresh <- 0.0008

lda.pred_thresh <- factor(ifelse(lda.prob[,2]>lda.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.lda_thresh <- confusionMatrix(factor(lda.pred_thresh),df_factor$Blue_Tarp_or_Not, positive = "BT") 
"Threshold:"
lda.thresh
cm.lda_thresh

acc_lda <- cm.lda_thresh[["overall"]][["Accuracy"]]*100
auc_lda <- lda_roc[["auc"]]
thresh_lda <- lda.thresh
sens_lda <-  cm.lda_thresh[["byClass"]][["Sensitivity"]]*100
spec_lda <- cm.lda_thresh[["byClass"]][["Specificity"]]*100
FDR_lda <- ((cm.lda_thresh[["table"]][2,1])/(cm.lda_thresh[["table"]][2,1]+cm.lda_thresh[["table"]][2,2]))*100
prec_lda <- cm.lda_thresh[["byClass"]][["Precision"]]*100

cost.lda <- cm.lda_thresh[["table"]][1]*0 + cm.lda_thresh[["table"]][2]*FP_C + cm.lda_thresh[["table"]][3]*FN_C + cm.lda_thresh[["table"]][4]*TP_C
cost.lda
#> [1] "Threshold:"
#> [1] 8e-04
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   NBT    BT
#>        NBT 56463   112
#>        BT   4756  1910
#>                                           
#>                Accuracy : 0.923           
#>                  95% CI : (0.9209, 0.9251)
#>     No Information Rate : 0.968           
#>     P-Value [Acc > NIR] : 1               
#>                                           
#>                   Kappa : 0.4108          
#>                                           
#>  Mcnemar's Test P-Value : <2e-16          
#>                                           
#>             Sensitivity : 0.94461         
#>             Specificity : 0.92231         
#>          Pos Pred Value : 0.28653         
#>          Neg Pred Value : 0.99802         
#>              Prevalence : 0.03197         
#>          Detection Rate : 0.03020         
#>    Detection Prevalence : 0.10541         
#>       Balanced Accuracy : 0.93346         
#>                                           
#>        'Positive' Class : BT              
#>                                           
#> [1] -4988000

Sampling Variability

"10 Fold Results"
lda.fit$resample  
lda.sd <- sd(lda.fit[["resample"]][["Accuracy"]]*100)
#> [1] "10 Fold Results"
#>     Accuracy     Kappa parameter Resample
#> 1  0.9857685 0.7793824      none   Fold01
#> 2  0.9854522 0.7824666      none   Fold02
#> 3  0.9829222 0.7328063      none   Fold03
#> 4  0.9845059 0.7630316      none   Fold04
#> 5  0.9807084 0.7153126      none   Fold05
#> 6  0.9843454 0.7632954      none   Fold06
#> 7  0.9802340 0.6914626      none   Fold07
#> 8  0.9867173 0.7912156      none   Fold08
#> 9  0.9840291 0.7507018      none   Fold09
#> 10 0.9841872 0.7548660      none   Fold10

The average accuracy across ten folds is 92.3 with a standard deviation of 0.208.

QDA

Model

#pass
fitControl <- trainControl(method = "cv",
                           number = 10,
                           returnResamp = 'all',
                           savePredictions = 'final',
                           classProbs = TRUE)

set.seed(4)
tic("QDA")
qda.fit <- caret::train(Blue_Tarp_or_Not~Red+Green+Blue,
                    data = df_factor,
                    preProcess=c("center","scale"),
                    method="qda",
                    verbose= FALSE,
                    trControl= fitControl)
qda.time <- toc(quiet = TRUE)
qda.fit
#> Quadratic Discriminant Analysis 
#> 
#> 63241 samples
#>     3 predictor
#>     2 classes: 'NBT', 'BT' 
#> 
#> Pre-processing: centered (3), scaled (3) 
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 56917, 56917, 56917, 56916, 56917, 56917, ... 
#> Resampling results:
#> 
#>   Accuracy   Kappa    
#>   0.9945763  0.9051517

ROC

#pass
qda.prob <- predict(qda.fit, newdata=df_factor , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
qda_roc <- roc(df_factor $Blue_Tarp_or_Not, qda.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="QDA ROC Curve") 

Thresholds

roc.info_qda <- roc(df_factor$Blue_Tarp_or_Not, qda.prob[,2], legacy.axes=TRUE)
roc.qda.df <- data.frame(tpp=roc.info_qda$sensitivities*100, fpp=(1-roc.info_qda$specificities)*100, thresholds=roc.info_qda$thresholds)
#roc.qda.df[roc.qda.df>98 & roc.qda.df < 99,]

fig4 <- plot_ly(roc.qda.df, x=~tpp, y=~fpp, z=~thresholds, colors = "#0072B2") #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig4 <- fig4 %>% add_markers()
fig4 <- fig4 %>% layout(scene=list(xaxis=list(title="True Positive Rate"),
                     yaxis = list(title = 'False Positive Rate'),
                     zaxis = list(title = 'Threshold')))

fig4

Confusion Matrix

qda.thresh <- 0.0016
qda.pred_thresh <- factor(ifelse(qda.prob[,2]>qda.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.qda_thresh <- confusionMatrix(factor(qda.pred_thresh),df_factor $Blue_Tarp_or_Not, positive = "BT") 
"Threshold:"
qda.thresh
cm.qda_thresh

acc_qda <- cm.qda_thresh[["overall"]][["Accuracy"]]*100
auc_qda <- qda_roc[["auc"]]
thresh_qda <- qda.thresh
sens_qda <-  cm.qda_thresh[["byClass"]][["Sensitivity"]]*100
spec_qda <- cm.qda_thresh[["byClass"]][["Specificity"]]*100
FDR_qda <- ((cm.qda_thresh[["table"]][2,1])/(cm.qda_thresh[["table"]][2,1]+cm.qda_thresh[["table"]][2,2]))*100
prec_qda <- cm.qda_thresh[["byClass"]][["Precision"]]*100

cost.qda <- cm.qda_thresh[["table"]][1]*0 + cm.qda_thresh[["table"]][2]*FP_C + cm.qda_thresh[["table"]][3]*FN_C + cm.qda_thresh[["table"]][4]*TP_C
cost.qda
#> [1] "Threshold:"
#> [1] 0.0016
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   NBT    BT
#>        NBT 55336     1
#>        BT   5883  2021
#>                                           
#>                Accuracy : 0.907           
#>                  95% CI : (0.9047, 0.9092)
#>     No Information Rate : 0.968           
#>     P-Value [Acc > NIR] : 1               
#>                                           
#>                   Kappa : 0.3754          
#>                                           
#>  Mcnemar's Test P-Value : <2e-16          
#>                                           
#>             Sensitivity : 0.99951         
#>             Specificity : 0.90390         
#>          Pos Pred Value : 0.25569         
#>          Neg Pred Value : 0.99998         
#>              Prevalence : 0.03197         
#>          Detection Rate : 0.03196         
#>    Detection Prevalence : 0.12498         
#>       Balanced Accuracy : 0.95170         
#>                                           
#>        'Positive' Class : BT              
#>                                           
#> [1] -4968750

Sampling Variability

"10 Fold Results"
qda.fit$resample #point est +/- std from 10 folds "variation in the third decimal place"... 
qda.sd <- sd(qda.fit[["resample"]][["Accuracy"]]*100)
#> [1] "10 Fold Results"
#>     Accuracy     Kappa parameter Resample
#> 1  0.9954143 0.9203183      none   Fold01
#> 2  0.9958887 0.9298247      none   Fold02
#> 3  0.9938330 0.8917279      none   Fold03
#> 4  0.9941502 0.8978148      none   Fold04
#> 5  0.9938330 0.8911622      none   Fold05
#> 6  0.9954143 0.9215293      none   Fold06
#> 7  0.9928843 0.8710492      none   Fold07
#> 8  0.9957306 0.9269411      none   Fold08
#> 9  0.9947818 0.9088589      none   Fold09
#> 10 0.9938330 0.8922908      none   Fold10

The average accuracy across ten folds is 90.7 with a standard deviation of 0.101.

KNN

Model

#pass
fitControl <- trainControl(method = "cv",
                           number = 10,
                           returnResamp = 'all',
                           savePredictions = 'final',
                           classProbs = TRUE)

set.seed(4)
tic("KNN")
knn.fit <- train(Blue_Tarp_or_Not~Red+Green+Blue,
                    data = df_factor,
                    preProcess=c("center","scale"),
                    method="knn",
                    trControl= fitControl,
                    tuneLength=10
                    )
knn.time <- toc(quiet = TRUE)
knn.fit
#> k-Nearest Neighbors 
#> 
#> 63241 samples
#>     3 predictor
#>     2 classes: 'NBT', 'BT' 
#> 
#> Pre-processing: centered (3), scaled (3) 
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 56917, 56917, 56917, 56916, 56917, 56917, ... 
#> Resampling results across tuning parameters:
#> 
#>   k   Accuracy   Kappa    
#>    5  0.9972328  0.9555000
#>    7  0.9973435  0.9573825
#>    9  0.9972803  0.9563272
#>   11  0.9971221  0.9537932
#>   13  0.9972170  0.9552906
#>   15  0.9971379  0.9540331
#>   17  0.9970431  0.9524698
#>   19  0.9969956  0.9517428
#>   21  0.9969956  0.9516839
#>   23  0.9969482  0.9509483
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was k = 7.
plot(knn.fit)

ROC

#pass
knn.prob <- predict(knn.fit, newdata=df_factor , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
knn_roc <- roc(df_factor $Blue_Tarp_or_Not, knn.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="KNN ROC Curve") 

Thresholds

Not sure why there are only 10 values for this one…?

roc.info_knn <- roc(df_factor$Blue_Tarp_or_Not, knn.prob[,2], legacy.axes=TRUE)
roc.knn.df <- data.frame(tpp=roc.info_knn$sensitivities*100, fpp=(1-roc.info_knn$specificities)*100, thresholds=roc.info_knn$thresholds)
#roc.knn.df[roc.knn.df>99 & roc.knn.df < 100,]
#roc.knn.df

fig5 <- plot_ly(roc.knn.df, x=~tpp, y=~fpp, z=~thresholds, colors = "#0072B2") #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig5 <- fig5 %>% add_markers()
fig5 <- fig5 %>% layout(scene=list(xaxis=list(title="True Positive Rate"),
                     yaxis = list(title = 'False Positive Rate'),
                     zaxis = list(title = 'Threshold')))

fig5

Confusion Matrix

knn.thresh <- 0.035
knn.pred_thresh <- factor(ifelse(knn.prob[,2]>knn.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.knn_thresh <- confusionMatrix(factor(knn.pred_thresh),df_factor $Blue_Tarp_or_Not, positive = "BT") 
"Threshold:"
knn.thresh
cm.knn_thresh

acc_knn <- cm.knn_thresh[["overall"]][["Accuracy"]]*100
auc_knn <- knn_roc[["auc"]]
thresh_knn <- knn.thresh
sens_knn <-  cm.knn_thresh[["byClass"]][["Sensitivity"]]*100
spec_knn <- cm.knn_thresh[["byClass"]][["Specificity"]]*100
FDR_knn <- ((cm.knn_thresh[["table"]][2,1])/(cm.knn_thresh[["table"]][2,1]+cm.knn_thresh[["table"]][2,2]))*100
prec_knn <- cm.knn_thresh[["byClass"]][["Precision"]]*100
k_knn <- knn.fit[["bestTune"]][["k"]]

cost.knn <- cm.knn_thresh[["table"]][1]*0 + cm.knn_thresh[["table"]][2]*FP_C + cm.knn_thresh[["table"]][3]*FN_C + cm.knn_thresh[["table"]][4]*TP_C
cost.knn
#> [1] "Threshold:"
#> [1] 0.035
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   NBT    BT
#>        NBT 60685     0
#>        BT    534  2022
#>                                           
#>                Accuracy : 0.9916          
#>                  95% CI : (0.9908, 0.9923)
#>     No Information Rate : 0.968           
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.879           
#>                                           
#>  Mcnemar's Test P-Value : < 2.2e-16       
#>                                           
#>             Sensitivity : 1.00000         
#>             Specificity : 0.99128         
#>          Pos Pred Value : 0.79108         
#>          Neg Pred Value : 1.00000         
#>              Prevalence : 0.03197         
#>          Detection Rate : 0.03197         
#>    Detection Prevalence : 0.04042         
#>       Balanced Accuracy : 0.99564         
#>                                           
#>        'Positive' Class : BT              
#>                                           
#> [1] -2289000

Sampling Variability

"10 Fold Results"
knn.fit$resample #point est +/- std from 10 folds "variation in the third decimal place"... 
knn.sd <- sd(knn.fit[["resample"]][["Accuracy"]]*100)
#> [1] "10 Fold Results"
#>      Accuracy     Kappa  k Resample
#> 1   0.9977862 0.9638571  5   Fold01
#> 2   0.9974700 0.9586938  7   Fold01
#> 3   0.9974700 0.9586938  9   Fold01
#> 4   0.9976281 0.9613688 11   Fold01
#> 5   0.9977862 0.9638571 13   Fold01
#> 6   0.9977862 0.9638571 15   Fold01
#> 7   0.9976281 0.9611816 17   Fold01
#> 8   0.9976281 0.9611816 19   Fold01
#> 9   0.9976281 0.9611816 21   Fold01
#> 10  0.9976281 0.9611816 23   Fold01
#> 11  0.9979443 0.9668394  5   Fold02
#> 12  0.9973118 0.9570464  7   Fold02
#> 13  0.9974700 0.9594773  9   Fold02
#> 14  0.9974700 0.9594773 11   Fold02
#> 15  0.9973118 0.9568423 13   Fold02
#> 16  0.9971537 0.9544119 15   Fold02
#> 17  0.9971537 0.9544119 17   Fold02
#> 18  0.9973118 0.9568423 19   Fold02
#> 19  0.9973118 0.9566362 21   Fold02
#> 20  0.9973118 0.9566362 23   Fold02
#> 21  0.9974700 0.9598579  5   Fold03
#> 22  0.9968374 0.9500570  7   Fold03
#> 23  0.9968374 0.9495856  9   Fold03
#> 24  0.9973118 0.9570464 11   Fold03
#> 25  0.9974700 0.9596685 13   Fold03
#> 26  0.9974700 0.9596685 15   Fold03
#> 27  0.9971537 0.9546271 17   Fold03
#> 28  0.9974700 0.9596685 19   Fold03
#> 29  0.9976281 0.9620998 21   Fold03
#> 30  0.9971537 0.9548402 23   Fold03
#> 31  0.9965217 0.9437481  5   Fold04
#> 32  0.9971542 0.9544120  7   Fold04
#> 33  0.9968379 0.9493467  9   Fold04
#> 34  0.9965217 0.9442814 11   Fold04
#> 35  0.9969960 0.9517651 13   Fold04
#> 36  0.9966798 0.9466877 15   Fold04
#> 37  0.9966798 0.9466877 17   Fold04
#> 38  0.9965217 0.9442814 19   Fold04
#> 39  0.9963636 0.9418865 21   Fold04
#> 40  0.9962055 0.9392161 23   Fold04
#> 41  0.9976281 0.9613688  5   Fold05
#> 42  0.9977862 0.9642031  7   Fold05
#> 43  0.9976281 0.9613688  9   Fold05
#> 44  0.9973118 0.9562180 11   Fold05
#> 45  0.9974700 0.9588925 13   Fold05
#> 46  0.9974700 0.9588925 15   Fold05
#> 47  0.9974700 0.9588925 17   Fold05
#> 48  0.9973118 0.9562180 19   Fold05
#> 49  0.9974700 0.9588925 21   Fold05
#> 50  0.9974700 0.9588925 23   Fold05
#> 51  0.9974700 0.9588925  5   Fold06
#> 52  0.9979443 0.9666803  7   Fold06
#> 53  0.9974700 0.9590893  9   Fold06
#> 54  0.9973118 0.9562180 11   Fold06
#> 55  0.9971537 0.9537541 13   Fold06
#> 56  0.9974700 0.9588925 15   Fold06
#> 57  0.9973118 0.9564281 17   Fold06
#> 58  0.9973118 0.9564281 19   Fold06
#> 59  0.9974700 0.9588925 21   Fold06
#> 60  0.9973118 0.9564281 23   Fold06
#> 61  0.9971537 0.9537541  5   Fold07
#> 62  0.9977862 0.9638571  7   Fold07
#> 63  0.9977862 0.9640309  9   Fold07
#> 64  0.9974700 0.9588925 11   Fold07
#> 65  0.9976281 0.9615542 13   Fold07
#> 66  0.9971537 0.9537541 15   Fold07
#> 67  0.9973118 0.9564281 17   Fold07
#> 68  0.9971537 0.9537541 19   Fold07
#> 69  0.9973118 0.9564281 21   Fold07
#> 70  0.9973118 0.9564281 23   Fold07
#> 71  0.9976281 0.9622782  5   Fold08
#> 72  0.9976281 0.9622782  7   Fold08
#> 73  0.9974700 0.9598579  9   Fold08
#> 74  0.9973118 0.9574490 11   Fold08
#> 75  0.9974700 0.9598579 13   Fold08
#> 76  0.9974700 0.9598579 15   Fold08
#> 77  0.9973118 0.9570464 17   Fold08
#> 78  0.9968374 0.9498224 19   Fold08
#> 79  0.9968374 0.9495856 21   Fold08
#> 80  0.9969956 0.9519931 23   Fold08
#> 81  0.9966793 0.9466875  5   Fold09
#> 82  0.9969956 0.9515345  7   Fold09
#> 83  0.9969956 0.9517649  9   Fold09
#> 84  0.9965212 0.9442812 11   Fold09
#> 85  0.9965212 0.9440158 13   Fold09
#> 86  0.9963631 0.9416101 15   Fold09
#> 87  0.9962049 0.9389263 17   Fold09
#> 88  0.9960468 0.9365327 19   Fold09
#> 89  0.9958887 0.9335201 21   Fold09
#> 90  0.9958887 0.9335201 23   Fold09
#> 91  0.9960468 0.9377164  5   Fold10
#> 92  0.9965212 0.9450623  7   Fold10
#> 93  0.9968374 0.9500567  9   Fold10
#> 94  0.9963631 0.9426991 11   Fold10
#> 95  0.9963631 0.9426991 13   Fold10
#> 96  0.9963631 0.9426991 15   Fold10
#> 97  0.9962049 0.9400680 17   Fold10
#> 98  0.9963631 0.9426991 19   Fold10
#> 99  0.9960468 0.9377164 21   Fold10
#> 100 0.9962049 0.9403468 23   Fold10

The average accuracy across ten folds when k = 7 is 99.16 with a standard deviation of 0.052.

Random Forest

Model

#pass
fitControl <- trainControl(method = "cv",
                           number = 10,
                           returnResamp = 'all',
                           savePredictions = 'final',
                           classProbs = TRUE)

set.seed(4)
tic("RF")
rf.fit <- train(Blue_Tarp_or_Not~Red+Green+Blue,
                    data = df_factor,
                    preProcess=c("center","scale"),
                    method="rf", #what is the difference between the different caret rf models??
                    trControl= fitControl,
                    tuneLength=3
                    )
rf.time <- toc(quiet = TRUE)
rf.fit
#> note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
#> 
#> Random Forest 
#> 
#> 63241 samples
#>     3 predictor
#>     2 classes: 'NBT', 'BT' 
#> 
#> Pre-processing: centered (3), scaled (3) 
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 56917, 56917, 56917, 56916, 56917, 56917, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy   Kappa    
#>   2     0.9969798  0.9509141
#>   3     0.9969482  0.9504008
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 2.
plot(rf.fit)

ROC

#pass
RF.prob <- predict(rf.fit, newdata=df_factor , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
RF_roc <- roc(df_factor $Blue_Tarp_or_Not, RF.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="RF ROC Curve") 

Thresholds

roc.info_rf <- roc(df_factor$Blue_Tarp_or_Not, RF.prob[,2], legacy.axes=TRUE)
roc.rf.df <- data.frame(tpp=roc.info_rf$sensitivities*100, fpp=(1-roc.info_rf$specificities)*100, thresholds=roc.info_rf$thresholds)
#roc.rf.df[roc.rf.df>99 & roc.rf.df < 100,]
#roc.rf.df

fig6 <- plot_ly(roc.rf.df, x=~tpp, y=~fpp, z=~thresholds, colors = "#0072B2") #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig6 <- fig6 %>% add_markers()
fig6 <- fig6 %>% layout(scene=list(xaxis=list(title="True Positive Rate"),
                     yaxis = list(title = 'False Positive Rate'),
                     zaxis = list(title = 'Threshold')))

fig6

Confusion Matrix

RF.thresh <- 0.45
RF.pred_thresh <- factor(ifelse(RF.prob[,2]>RF.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.RF_thresh <- confusionMatrix(factor(RF.pred_thresh),df_factor $Blue_Tarp_or_Not, positive = "BT") 
"Threshold:"
RF.thresh
cm.RF_thresh

acc_RF <- cm.RF_thresh[["overall"]][["Accuracy"]]*100
auc_RF <- RF_roc[["auc"]]
thresh_RF <- RF.thresh
sens_RF <-  cm.RF_thresh[["byClass"]][["Sensitivity"]]*100
spec_RF <- cm.RF_thresh[["byClass"]][["Specificity"]]*100
FDR_RF <- ((cm.RF_thresh[["table"]][2,1])/(cm.RF_thresh[["table"]][2,1]+cm.RF_thresh[["table"]][2,2]))*100
prec_RF <- cm.RF_thresh[["byClass"]][["Precision"]]*100
mtry_best <- rf.fit[["bestTune"]][["mtry"]]
ntree <- rf.fit[["finalModel"]][["ntree"]]

cost.RF <- cm.RF_thresh[["table"]][1]*0 + cm.RF_thresh[["table"]][2]*FP_C + cm.RF_thresh[["table"]][3]*FN_C + cm.RF_thresh[["table"]][4]*TP_C
cost.RF
#> [1] "Threshold:"
#> [1] 0.45
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   NBT    BT
#>        NBT 61214    22
#>        BT      5  2000
#>                                           
#>                Accuracy : 0.9996          
#>                  95% CI : (0.9994, 0.9997)
#>     No Information Rate : 0.968           
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.9931          
#>                                           
#>  Mcnemar's Test P-Value : 0.002076        
#>                                           
#>             Sensitivity : 0.98912         
#>             Specificity : 0.99992         
#>          Pos Pred Value : 0.99751         
#>          Neg Pred Value : 0.99964         
#>              Prevalence : 0.03197         
#>          Detection Rate : 0.03163         
#>    Detection Prevalence : 0.03170         
#>       Balanced Accuracy : 0.99452         
#>                                           
#>        'Positive' Class : BT              
#>                                           
#> [1] -2140000

Sampling Variability

"10 Fold Results"
rf.fit$resample #point est +/- std from 10 folds "variation in the third decimal place"... 
rf.sd <- sd(rf.fit[["resample"]][["Accuracy"]]*100)
#> [1] "10 Fold Results"
#>     Accuracy     Kappa mtry Resample
#> 1  0.9971537 0.9528468    2   Fold01
#> 2  0.9973118 0.9555754    3   Fold01
#> 3  0.9973118 0.9560058    2   Fold02
#> 4  0.9974700 0.9588925    3   Fold02
#> 5  0.9963631 0.9416101    2   Fold03
#> 6  0.9965212 0.9442812    3   Fold03
#> 7  0.9966798 0.9453903    2   Fold04
#> 8  0.9968379 0.9476083    3   Fold04
#> 9  0.9971537 0.9533048    2   Fold05
#> 10 0.9963631 0.9401887    3   Fold05
#> 11 0.9977862 0.9643737    2   Fold06
#> 12 0.9976281 0.9617378    3   Fold06
#> 13 0.9969956 0.9505907    2   Fold07
#> 14 0.9971537 0.9530769    3   Fold07
#> 15 0.9971537 0.9537541    2   Fold08
#> 16 0.9973118 0.9564281    3   Fold08
#> 17 0.9968374 0.9491052    2   Fold09
#> 18 0.9966793 0.9464329    3   Fold09
#> 19 0.9963631 0.9421597    2   Fold10
#> 20 0.9962049 0.9397866    3   Fold10

The average accuracy across ten folds when mtry = 2 is 99.96 with a standard deviation of 0.045.

SVM

Model

#pass
fitControl <- trainControl(method = "cv",
                           number = 10,
                           returnResamp = 'all',
                           savePredictions = 'final',
                           classProbs = TRUE)

set.seed(4)
tic("SVM")
svm.radial.fit <- train(Blue_Tarp_or_Not~Red+Green+Blue,
                    data = df_factor,
                    preProcess=c("center","scale"),
                    method="svmRadial",
                    trControl= fitControl,
                    tuneLength=10
                    #tuneGrid = expand.grid(C=seq(0,10, length=10),
                    #                           sigma =seq(0,10, length=10))
                    )
svm.time <- toc(quiet = TRUE)
svm.radial.fit
"Summary"
summary(svm.radial.fit)
#> Support Vector Machines with Radial Basis Function Kernel 
#> 
#> 63241 samples
#>     3 predictor
#>     2 classes: 'NBT', 'BT' 
#> 
#> Pre-processing: centered (3), scaled (3) 
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 56917, 56917, 56917, 56916, 56917, 56917, ... 
#> Resampling results across tuning parameters:
#> 
#>   C       Accuracy   Kappa    
#>     0.25  0.9967268  0.9463658
#>     0.50  0.9968849  0.9489059
#>     1.00  0.9969798  0.9505083
#>     2.00  0.9970589  0.9518827
#>     4.00  0.9971063  0.9526307
#>     8.00  0.9971854  0.9538882
#>    16.00  0.9972170  0.9545384
#>    32.00  0.9973909  0.9575077
#>    64.00  0.9973751  0.9574081
#>   128.00  0.9974384  0.9584802
#> 
#> Tuning parameter 'sigma' was held constant at a value of 9.020118
#> Accuracy was used to select the optimal model using the largest value.
#> The final values used for the model were sigma = 9.020118 and C = 128.
#> [1] "Summary"
#> Length  Class   Mode 
#>      1   ksvm     S4
plot(svm.radial.fit)

Both linear and poly SVM functions were considered. Radial SVM produced the highest accuracy values of the three methods. SVM radial was chosen for building the SVM model.

ROC

#pass
SVM.prob <- predict(svm.radial.fit, newdata=df_factor , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
SVM_roc <- roc(df_factor $Blue_Tarp_or_Not, SVM.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#0072B2", lwd=4, print.auc=TRUE, main="SVM ROC Curve") 

Thresholds

roc.info_svm <- roc(df_factor$Blue_Tarp_or_Not, SVM.prob[,2], legacy.axes=TRUE)
roc.svm.df <- data.frame(tpp=roc.info_svm$sensitivities*100, fpp=(1-roc.info_svm$specificities)*100, thresholds=roc.info_svm$thresholds)
#roc.svm.df[roc.svm.df>99 & roc.svm.df < 100,]
#roc.svm.df

fig7 <- plot_ly(roc.svm.df, x=~tpp, y=~fpp, z=~thresholds, colors = "#0072B2") #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig7 <- fig7 %>% add_markers()
fig7 <- fig7 %>% layout(scene=list(xaxis=list(title="True Positive Rate"),
                     yaxis = list(title = 'False Positive Rate'),
                     zaxis = list(title = 'Threshold')))

fig7

Confusion Matrix

SVM.thresh <- 0.0017
SVM.pred_thresh <- factor(ifelse(SVM.prob[,2]>SVM.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.SVM_thresh <- confusionMatrix(factor(SVM.pred_thresh),df_factor $Blue_Tarp_or_Not, positive = "BT") 
"Threshold:"
SVM.thresh
cm.SVM_thresh

acc_SVM <- cm.SVM_thresh[["overall"]][["Accuracy"]]*100
auc_SVM <- SVM_roc[["auc"]]
thresh_SVM <- SVM.thresh
sens_SVM <-  cm.SVM_thresh[["byClass"]][["Sensitivity"]]*100
spec_SVM <- cm.SVM_thresh[["byClass"]][["Specificity"]]*100
FDR_SVM <- ((cm.SVM_thresh[["table"]][2,1])/(cm.SVM_thresh[["table"]][2,1]+cm.SVM_thresh[["table"]][2,2]))*100
prec_SVM <- cm.SVM_thresh[["byClass"]][["Precision"]]*100
sigma_best <- round(svm.radial.fit[["bestTune"]][["sigma"]], 4)
C_best <- svm.radial.fit[["bestTune"]][["C"]]

cost.SVM <- cm.SVM_thresh[["table"]][1]*0 + cm.SVM_thresh[["table"]][2]*FP_C + cm.SVM_thresh[["table"]][3]*FN_C + cm.SVM_thresh[["table"]][4]*TP_C
cost.SVM
#> [1] "Threshold:"
#> [1] 0.0017
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   NBT    BT
#>        NBT 59710     4
#>        BT   1509  2018
#>                                           
#>                Accuracy : 0.9761          
#>                  95% CI : (0.9749, 0.9773)
#>     No Information Rate : 0.968           
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.7158          
#>                                           
#>  Mcnemar's Test P-Value : < 2.2e-16       
#>                                           
#>             Sensitivity : 0.99802         
#>             Specificity : 0.97535         
#>          Pos Pred Value : 0.57216         
#>          Neg Pred Value : 0.99993         
#>              Prevalence : 0.03197         
#>          Detection Rate : 0.03191         
#>    Detection Prevalence : 0.05577         
#>       Balanced Accuracy : 0.98669         
#>                                           
#>        'Positive' Class : BT              
#>                                           
#> [1] -2797500

Sampling Variability

"10 Fold Results"
svm.radial.fit$resample #point est +/- std from 10 folds "variation in the third decimal place"... 
svm.sd <- sd(svm.radial.fit[["resample"]][["Accuracy"]]*100)
#> [1] "10 Fold Results"
#>      Accuracy     Kappa    sigma      C Resample
#> 1   0.9966793 0.9448527 9.020118   0.25   Fold01
#> 2   0.9968374 0.9476075 9.020118   0.50   Fold01
#> 3   0.9973118 0.9555754 9.020118   1.00   Fold01
#> 4   0.9973118 0.9555754 9.020118   2.00   Fold01
#> 5   0.9974700 0.9580860 9.020118   4.00   Fold01
#> 6   0.9974700 0.9580860 9.020118   8.00   Fold01
#> 7   0.9974700 0.9580860 9.020118  16.00   Fold01
#> 8   0.9974700 0.9580860 9.020118  32.00   Fold01
#> 9   0.9974700 0.9582906 9.020118  64.00   Fold01
#> 10  0.9974700 0.9582906 9.020118 128.00   Fold01
#> 11  0.9966793 0.9456543 9.020118   0.25   Fold02
#> 12  0.9971537 0.9535305 9.020118   0.50   Fold02
#> 13  0.9973118 0.9560058 9.020118   1.00   Fold02
#> 14  0.9973118 0.9562180 9.020118   2.00   Fold02
#> 15  0.9971537 0.9535305 9.020118   4.00   Fold02
#> 16  0.9974700 0.9586938 9.020118   8.00   Fold02
#> 17  0.9974700 0.9586938 9.020118  16.00   Fold02
#> 18  0.9977862 0.9640309 9.020118  32.00   Fold02
#> 19  0.9977862 0.9643737 9.020118  64.00   Fold02
#> 20  0.9976281 0.9619196 9.020118 128.00   Fold02
#> 21  0.9966793 0.9453897 9.020118   0.25   Fold03
#> 22  0.9971537 0.9535305 9.020118   0.50   Fold03
#> 23  0.9974700 0.9588925 9.020118   1.00   Fold03
#> 24  0.9974700 0.9590893 9.020118   2.00   Fold03
#> 25  0.9974700 0.9590893 9.020118   4.00   Fold03
#> 26  0.9976281 0.9615542 9.020118   8.00   Fold03
#> 27  0.9973118 0.9566362 9.020118  16.00   Fold03
#> 28  0.9971537 0.9544119 9.020118  32.00   Fold03
#> 29  0.9969956 0.9517649 9.020118  64.00   Fold03
#> 30  0.9969956 0.9517649 9.020118 128.00   Fold03
#> 31  0.9963636 0.9410501 9.020118   0.25   Fold04
#> 32  0.9960474 0.9356152 9.020118   0.50   Fold04
#> 33  0.9963636 0.9407660 9.020118   1.00   Fold04
#> 34  0.9963636 0.9407660 9.020118   2.00   Fold04
#> 35  0.9963636 0.9407660 9.020118   4.00   Fold04
#> 36  0.9962055 0.9383392 9.020118   8.00   Fold04
#> 37  0.9965217 0.9437481 9.020118  16.00   Fold04
#> 38  0.9969960 0.9515348 9.020118  32.00   Fold04
#> 39  0.9973123 0.9568424 9.020118  64.00   Fold04
#> 40  0.9974704 0.9594774 9.020118 128.00   Fold04
#> 41  0.9971537 0.9530769 9.020118   0.25   Fold05
#> 42  0.9974700 0.9580860 9.020118   0.50   Fold05
#> 43  0.9971537 0.9530769 9.020118   1.00   Fold05
#> 44  0.9973118 0.9555754 9.020118   2.00   Fold05
#> 45  0.9973118 0.9553569 9.020118   4.00   Fold05
#> 46  0.9973118 0.9553569 9.020118   8.00   Fold05
#> 47  0.9974700 0.9580860 9.020118  16.00   Fold05
#> 48  0.9974700 0.9580860 9.020118  32.00   Fold05
#> 49  0.9974700 0.9582906 9.020118  64.00   Fold05
#> 50  0.9976281 0.9609926 9.020118 128.00   Fold05
#> 51  0.9966793 0.9448527 9.020118   0.25   Fold06
#> 52  0.9966793 0.9448527 9.020118   0.50   Fold06
#> 53  0.9968374 0.9478633 9.020118   1.00   Fold06
#> 54  0.9968374 0.9478633 9.020118   2.00   Fold06
#> 55  0.9971537 0.9530769 9.020118   4.00   Fold06
#> 56  0.9973118 0.9557916 9.020118   8.00   Fold06
#> 57  0.9973118 0.9557916 9.020118  16.00   Fold06
#> 58  0.9977862 0.9638571 9.020118  32.00   Fold06
#> 59  0.9976281 0.9611816 9.020118  64.00   Fold06
#> 60  0.9977862 0.9640309 9.020118 128.00   Fold06
#> 61  0.9969956 0.9505907 9.020118   0.25   Fold07
#> 62  0.9969956 0.9505907 9.020118   0.50   Fold07
#> 63  0.9968374 0.9476075 9.020118   1.00   Fold07
#> 64  0.9971537 0.9530769 9.020118   2.00   Fold07
#> 65  0.9971537 0.9530769 9.020118   4.00   Fold07
#> 66  0.9969956 0.9501048 9.020118   8.00   Fold07
#> 67  0.9973118 0.9555754 9.020118  16.00   Fold07
#> 68  0.9974700 0.9582906 9.020118  32.00   Fold07
#> 69  0.9974700 0.9584932 9.020118  64.00   Fold07
#> 70  0.9974700 0.9582906 9.020118 128.00   Fold07
#> 71  0.9968374 0.9486156 9.020118   0.25   Fold08
#> 72  0.9971537 0.9535305 9.020118   0.50   Fold08
#> 73  0.9974700 0.9586938 9.020118   1.00   Fold08
#> 74  0.9976281 0.9613688 9.020118   2.00   Fold08
#> 75  0.9977862 0.9640309 9.020118   4.00   Fold08
#> 76  0.9977862 0.9640309 9.020118   8.00   Fold08
#> 77  0.9977862 0.9640309 9.020118  16.00   Fold08
#> 78  0.9981025 0.9693170 9.020118  32.00   Fold08
#> 79  0.9977862 0.9643737 9.020118  64.00   Fold08
#> 80  0.9977862 0.9642031 9.020118 128.00   Fold08
#> 81  0.9962049 0.9380407 9.020118   0.25   Fold09
#> 82  0.9962049 0.9377398 9.020118   0.50   Fold09
#> 83  0.9962049 0.9377398 9.020118   1.00   Fold09
#> 84  0.9963631 0.9401887 9.020118   2.00   Fold09
#> 85  0.9963631 0.9401887 9.020118   4.00   Fold09
#> 86  0.9966793 0.9453897 9.020118   8.00   Fold09
#> 87  0.9966793 0.9453897 9.020118  16.00   Fold09
#> 88  0.9968374 0.9481165 9.020118  32.00   Fold09
#> 89  0.9971537 0.9535305 9.020118  64.00   Fold09
#> 90  0.9974700 0.9588925 9.020118 128.00   Fold09
#> 91  0.9969956 0.9515345 9.020118   0.25   Fold10
#> 92  0.9971537 0.9539755 9.020118   0.50   Fold10
#> 93  0.9968374 0.9488616 9.020118   1.00   Fold10
#> 94  0.9968374 0.9491052 9.020118   2.00   Fold10
#> 95  0.9968374 0.9491052 9.020118   4.00   Fold10
#> 96  0.9969956 0.9515345 9.020118   8.00   Fold10
#> 97  0.9968374 0.9493464 9.020118  16.00   Fold10
#> 98  0.9968374 0.9493464 9.020118  32.00   Fold10
#> 99  0.9966793 0.9469395 9.020118  64.00   Fold10
#> 100 0.9966793 0.9469395 9.020118 128.00   Fold10

The average accuracy across ten folds is 97.61 with a standard deviation of 0.045.

K-Folds Out of Sampling Performance

Table 2 - Performance Summary

Performance Metrics: 10-Fold Cross-Validation Metrics

Method KNN (k = 7) LDA QDA Log. Regression Random Forest (mtry = 2, nTrees = 500) SVM (sigma = 9.0201, C = 128)
Accuracy 99.16% 92.3% 90.7% 90.71% 99.96% 97.61%
AUC 99.98% 98.89% 99.82% 99.85% 99.45% 99.92%
ROC
Threshold 0.04 0 0 0 0.45 0
Sensitivity 100% 94.46% 99.95% 99.8% 98.91% 99.8%
Specificity 99.13% 92.23% 90.39% 90.41% 99.99% 97.54%
FDR 20.89% 71.35% 74.43% 74.43% 0.25% 42.78%
Precision 79.11% 28.65% 25.57% 25.57% 99.75% 57.22%

Model performance based on running training data through the fitted model. Make some comments here about variation between folds

Model Performance

Hold-Out Test Sample Performance

df_FHO <- tibble(read.csv("SYS6018_FHO_Data.csv"))
df_FHO$Blue_Tarp_or_Not <- factor(df_FHO$Blue_Tarp_or_Not, labels = c("NBT", "BT"))
df_FHO
#> # A tibble: 2,004,177 x 4
#>      Red Green  Blue Blue_Tarp_or_Not
#>    <int> <int> <int> <fct>           
#>  1   104    89    63 NBT             
#>  2   101    80    60 NBT             
#>  3   103    87    69 NBT             
#>  4   107    93    72 NBT             
#>  5   109    99    68 NBT             
#>  6   103    73    53 NBT             
#>  7   100    79    56 NBT             
#>  8    98    70    51 NBT             
#>  9    97    73    56 NBT             
#> 10    99    79    61 NBT             
#> # ... with 2,004,167 more rows

Logistic Regression

tic("Log.Reg. Predict")
glm.prob_FHO <- predict(glm.fit, newdata=df_FHO , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
glm.predict.time <- toc(quiet = TRUE)

lr.thresh_FHO <- lr.thresh
glm.pred_thresh_FHO <- factor(ifelse(glm.prob_FHO[,2]>lr.thresh_FHO,"BT", "NBT"), levels=c("NBT", "BT"))
cm.glm_thresh_FHO <- confusionMatrix(factor(glm.pred_thresh_FHO),df_FHO$Blue_Tarp_or_Not, positive = "BT") 
"Threshold:" 
cm.glm_thresh_FHO

acc_LR_FHO <- round(cm.glm_thresh_FHO[["overall"]][["Accuracy"]]*100, 2)
auc_LR_FHO <- round(glm_roc[["auc"]], 2)
thresh_LR_FHO <- lr.thresh_FHO
sens_LR_FHO <-  round(cm.glm_thresh_FHO[["byClass"]][["Sensitivity"]]*100, 2)
spec_LR_FHO <- round(cm.glm_thresh_FHO[["byClass"]][["Specificity"]]*100, 2)
FDR_LR_FHO <- round(((cm.glm_thresh_FHO[["table"]][2,1])/(cm.glm_thresh_FHO[["table"]][2,1]+cm.glm_thresh_FHO[["table"]][2,2]))*100, 2)
prec_LR_FHO <- round(cm.glm_thresh_FHO[["byClass"]][["Precision"]]*100, 2)
#> [1] "Threshold:"
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction     NBT      BT
#>        NBT 1617298       0
#>        BT   372399   14480
#>                                           
#>                Accuracy : 0.8142          
#>                  95% CI : (0.8136, 0.8147)
#>     No Information Rate : 0.9928          
#>     P-Value [Acc > NIR] : 1               
#>                                           
#>                   Kappa : 0.059           
#>                                           
#>  Mcnemar's Test P-Value : <2e-16          
#>                                           
#>             Sensitivity : 1.000000        
#>             Specificity : 0.812836        
#>          Pos Pred Value : 0.037428        
#>          Neg Pred Value : 1.000000        
#>              Prevalence : 0.007225        
#>          Detection Rate : 0.007225        
#>    Detection Prevalence : 0.193036        
#>       Balanced Accuracy : 0.906418        
#>                                           
#>        'Positive' Class : BT              
#> 

LDA

tic("LDA Predict")
lda.prob_FHO <- predict(lda.fit, newdata=df_FHO, type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
lda.predict.time <- toc(quiet = TRUE)

lda.thresh_FHO <- lda.thresh
lda.pred_thresh_FHO <- factor(ifelse(lda.prob_FHO[,2]>lda.thresh_FHO,"BT", "NBT"), levels=c("NBT", "BT"))
cm.lda_thresh_FHO <- confusionMatrix(factor(lda.pred_thresh_FHO),df_FHO$Blue_Tarp_or_Not, positive = "BT") 
"Threshold:" 
cm.lda_thresh_FHO

acc_lda_FHO <- round(cm.lda_thresh_FHO[["overall"]][["Accuracy"]]*100, 2)
auc_lda_FHO <- round(lda_roc[["auc"]],2)
thresh_lda_FHO <- lda.thresh_FHO
sens_lda_FHO <-  round(cm.lda_thresh_FHO[["byClass"]][["Sensitivity"]]*100,2)
spec_lda_FHO <- round(cm.lda_thresh_FHO[["byClass"]][["Specificity"]]*100,2)
FDR_lda_FHO <- round(((cm.lda_thresh_FHO[["table"]][2,1])/(cm.lda_thresh_FHO[["table"]][2,1]+cm.lda_thresh_FHO[["table"]][2,2]))*100,2)
prec_lda_FHO <- round(cm.lda_thresh_FHO[["byClass"]][["Precision"]]*100, 2)
#> [1] "Threshold:"
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction     NBT      BT
#>        NBT 1891276      83
#>        BT    98421   14397
#>                                           
#>                Accuracy : 0.9509          
#>                  95% CI : (0.9506, 0.9511)
#>     No Information Rate : 0.9928          
#>     P-Value [Acc > NIR] : 1               
#>                                           
#>                   Kappa : 0.2162          
#>                                           
#>  Mcnemar's Test P-Value : <2e-16          
#>                                           
#>             Sensitivity : 0.994268        
#>             Specificity : 0.950535        
#>          Pos Pred Value : 0.127613        
#>          Neg Pred Value : 0.999956        
#>              Prevalence : 0.007225        
#>          Detection Rate : 0.007183        
#>    Detection Prevalence : 0.056291        
#>       Balanced Accuracy : 0.972401        
#>                                           
#>        'Positive' Class : BT              
#> 

QDA

tic("QDA Predict")
qda.prob_FHO <- predict(qda.fit, newdata=df_FHO , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
qda.predict.time <- toc(quiet = TRUE)

qda.thresh_FHO <- qda.thresh
qda.pred_thresh_FHO <- factor(ifelse(qda.prob_FHO[,2]>qda.thresh_FHO,"BT", "NBT"), levels=c("NBT", "BT"))
cm.qda_thresh_FHO <- confusionMatrix(factor(qda.pred_thresh_FHO),df_FHO$Blue_Tarp_or_Not, positive = "BT") 
"Threshold:" 
cm.qda_thresh_FHO

acc_qda_FHO <- round(cm.qda_thresh_FHO[["overall"]][["Accuracy"]]*100,2)
auc_qda_FHO <- round(qda_roc[["auc"]],2)
thresh_qda_FHO <- qda.thresh_FHO
sens_qda_FHO <-  round(cm.qda_thresh_FHO[["byClass"]][["Sensitivity"]]*100,2)
spec_qda_FHO <- round(cm.qda_thresh_FHO[["byClass"]][["Specificity"]]*100,2)
FDR_qda_FHO <- round(((cm.qda_thresh_FHO[["table"]][2,1])/(cm.qda_thresh_FHO[["table"]][2,1]+cm.qda_thresh_FHO[["table"]][2,2]))*100,2)
prec_qda_FHO <- round(cm.qda_thresh_FHO[["byClass"]][["Precision"]]*100,2)
#> [1] "Threshold:"
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction     NBT      BT
#>        NBT 1750542      87
#>        BT   239155   14393
#>                                           
#>                Accuracy : 0.8806          
#>                  95% CI : (0.8802, 0.8811)
#>     No Information Rate : 0.9928          
#>     P-Value [Acc > NIR] : 1               
#>                                           
#>                   Kappa : 0.095           
#>                                           
#>  Mcnemar's Test P-Value : <2e-16          
#>                                           
#>             Sensitivity : 0.993992        
#>             Specificity : 0.879803        
#>          Pos Pred Value : 0.056766        
#>          Neg Pred Value : 0.999950        
#>              Prevalence : 0.007225        
#>          Detection Rate : 0.007182        
#>    Detection Prevalence : 0.126510        
#>       Balanced Accuracy : 0.936898        
#>                                           
#>        'Positive' Class : BT              
#> 

KNN

tic("KNN Predict")
knn.prob_FHO <- predict(knn.fit, newdata=df_FHO , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
knn.predict.time <- toc(quiet = TRUE)

knn.thresh_FHO <- knn.thresh
knn.pred_thresh_FHO <- factor(ifelse(knn.prob_FHO[,2]>knn.thresh_FHO,"BT", "NBT"), levels=c("NBT", "BT"))
cm.knn_thresh_FHO <- confusionMatrix(factor(knn.pred_thresh_FHO),df_FHO$Blue_Tarp_or_Not, positive = "BT") 
"Threshold:" 
cm.knn_thresh_FHO

acc_knn_FHO <- round(cm.knn_thresh_FHO[["overall"]][["Accuracy"]]*100,2)
auc_knn_FHO <- round(knn_roc[["auc"]],2)
thresh_knn_FHO <- knn.thresh_FHO
sens_knn_FHO <-  round(cm.knn_thresh_FHO[["byClass"]][["Sensitivity"]]*100,2)
spec_knn_FHO <- round(cm.knn_thresh_FHO[["byClass"]][["Specificity"]]*100,2)
FDR_knn_FHO <- round(((cm.knn_thresh_FHO[["table"]][2,1])/(cm.knn_thresh_FHO[["table"]][2,1]+cm.knn_thresh_FHO[["table"]][2,2]))*100,2)
prec_knn_FHO <- round(cm.knn_thresh_FHO[["byClass"]][["Precision"]]*100,2)
#> [1] "Threshold:"
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction     NBT      BT
#>        NBT 1941165     987
#>        BT    48532   13493
#>                                           
#>                Accuracy : 0.9753          
#>                  95% CI : (0.9751, 0.9755)
#>     No Information Rate : 0.9928          
#>     P-Value [Acc > NIR] : 1               
#>                                           
#>                   Kappa : 0.3451          
#>                                           
#>  Mcnemar's Test P-Value : <2e-16          
#>                                           
#>             Sensitivity : 0.931837        
#>             Specificity : 0.975608        
#>          Pos Pred Value : 0.217541        
#>          Neg Pred Value : 0.999492        
#>              Prevalence : 0.007225        
#>          Detection Rate : 0.006732        
#>    Detection Prevalence : 0.030948        
#>       Balanced Accuracy : 0.953723        
#>                                           
#>        'Positive' Class : BT              
#> 

RF

tic("RF Predict")
RF.prob_FHO <- predict(rf.fit, newdata=df_FHO , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
rf.predict.time <- toc(quiet = TRUE)

RF.thresh_FHO <- RF.thresh
RF.pred_thresh_FHO <- factor(ifelse(RF.prob_FHO[,2]>RF.thresh_FHO,"BT", "NBT"), levels=c("NBT", "BT"))
cm.RF_thresh_FHO <- confusionMatrix(factor(RF.pred_thresh_FHO),df_FHO$Blue_Tarp_or_Not, positive = "BT") 
"Threshold:"
cm.RF_thresh_FHO

acc_RF_FHO <- round(cm.RF_thresh_FHO[["overall"]][["Accuracy"]]*100,2)
auc_RF_FHO <- round(RF_roc[["auc"]],2)
thresh_RF_FHO <- RF.thresh
sens_RF_FHO <-  round(cm.RF_thresh_FHO[["byClass"]][["Sensitivity"]]*100,2)
spec_RF_FHO <- round(cm.RF_thresh_FHO[["byClass"]][["Specificity"]]*100,2)
FDR_RF_FHO <- round(((cm.RF_thresh_FHO[["table"]][2,1])/(cm.RF_thresh_FHO[["table"]][2,1]+cm.RF_thresh_FHO[["table"]][2,2]))*100,2)
prec_RF_FHO <- round(cm.RF_thresh_FHO[["byClass"]][["Precision"]]*100,2)
#> [1] "Threshold:"
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction     NBT      BT
#>        NBT 1978213    2930
#>        BT    11484   11550
#>                                           
#>                Accuracy : 0.9928          
#>                  95% CI : (0.9927, 0.9929)
#>     No Information Rate : 0.9928          
#>     P-Value [Acc > NIR] : 0.2928          
#>                                           
#>                   Kappa : 0.6123          
#>                                           
#>  Mcnemar's Test P-Value : <2e-16          
#>                                           
#>             Sensitivity : 0.797652        
#>             Specificity : 0.994228        
#>          Pos Pred Value : 0.501433        
#>          Neg Pred Value : 0.998521        
#>              Prevalence : 0.007225        
#>          Detection Rate : 0.005763        
#>    Detection Prevalence : 0.011493        
#>       Balanced Accuracy : 0.895940        
#>                                           
#>        'Positive' Class : BT              
#> 

SVM

tic("SVM Predict")
SVM.prob_FHO <- predict(svm.radial.fit, newdata=df_FHO , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
svm.predict.time <-  toc(quiet = TRUE)

SVM.thresh_FHO <- SVM.thresh
SVM.pred_thresh_FHO <- factor(ifelse(SVM.prob_FHO[,2]>SVM.thresh_FHO,"BT", "NBT"), levels=c("NBT", "BT"))
cm.SVM_thresh_FHO <- confusionMatrix(factor(SVM.pred_thresh_FHO),df_FHO$Blue_Tarp_or_Not, positive = "BT") 
"Threshold:"
cm.SVM_thresh_FHO

acc_SVM_FHO <- round(cm.SVM_thresh_FHO[["overall"]][["Accuracy"]]*100,2)
auc_SVM_FHO <- round(SVM_roc[["auc"]],2)
thresh_SVM_FHO <- SVM.thresh
sens_SVM_FHO <-  round(cm.SVM_thresh_FHO[["byClass"]][["Sensitivity"]]*100,2)
spec_SVM_FHO <- round(cm.SVM_thresh_FHO[["byClass"]][["Specificity"]]*100,2)
FDR_SVM_FHO <- round(((cm.SVM_thresh_FHO[["table"]][2,1])/(cm.SVM_thresh_FHO[["table"]][2,1]+cm.SVM_thresh_FHO[["table"]][2,2]))*100,2)
prec_SVM_FHO <- round(cm.SVM_thresh_FHO[["byClass"]][["Precision"]]*100,2)
#> [1] "Threshold:"
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction     NBT      BT
#>        NBT 1915164    2172
#>        BT    74533   12308
#>                                          
#>                Accuracy : 0.9617         
#>                  95% CI : (0.9615, 0.962)
#>     No Information Rate : 0.9928         
#>     P-Value [Acc > NIR] : 1              
#>                                          
#>                   Kappa : 0.2335         
#>                                          
#>  Mcnemar's Test P-Value : <2e-16         
#>                                          
#>             Sensitivity : 0.850000       
#>             Specificity : 0.962541       
#>          Pos Pred Value : 0.141730       
#>          Neg Pred Value : 0.998867       
#>              Prevalence : 0.007225       
#>          Detection Rate : 0.006141       
#>    Detection Prevalence : 0.043330       
#>       Balanced Accuracy : 0.906270       
#>                                          
#>        'Positive' Class : BT             
#> 

Table 3 - FHO Performance Summary

Performance Metrics: Hold-Out Test Data Set Scores

Method KNN (k = 7) LDA QDA Log. Regression Random Forest (mtry = 2, nTrees = 500) SVM (sigma = 9.0201, C = 128)
Accuracy 97.53% 95.09% 88.06% 81.42% 99.28% 96.17%
AUC 99.98% 98.89% 99.82% 99.85% 99.45% 99.92%
ROC
Threshold 0.035 810^{-4} 0.0016 0.003 0.45 0.0017
Sensitivity=Recall=Power 93.18% 99.43% 99.4% 100% 79.77% 85%
Specificity=1-FPR 97.56% 95.05% 87.98% 81.28% 99.42% 96.25%
FDR 78.25% 87.24% 94.32% 96.26% 49.86% 85.83%
Precision=PPV 21.75% 12.76% 5.68% 3.74% 50.14% 14.17%
Cost

Using the performance results for each model on the final hold out data set, the predicted cost …

Discussion

mfrt.glm <- round((glm.time$toc-glm.time$tic), 2) 
mfrt.lda <- round((lda.time$toc-lda.time$tic), 2)
mfrt.qda <- round((qda.time$toc-qda.time$tic), 2) 
mfrt.knn <- round((knn.time$toc-knn.time$tic), 2) 
mfrt.rf <- round((rf.time$toc-rf.time$tic), 2)
mfrt.svm <- round((svm.time$toc-svm.time$tic), 2)
Model Run Time (s)
Log. Reg. 11.13
LDA 1.71
QDA 1.5
KNN 40.87
RF 68.94
SVM 176.14
mprt.glm <- round((glm.predict.time$toc-glm.predict.time$tic), 2) 
mprt.lda <- round((lda.predict.time$toc-lda.predict.time$tic),2)
mprt.qda <- round((qda.predict.time$toc-qda.predict.time$tic) ,2)
mprt.knn <- round((knn.predict.time$toc-knn.predict.time$tic) ,2)
mprt.rf <- round((rf.predict.time$toc-rf.predict.time$tic),2)
mprt.svm <- round((svm.predict.time$toc-svm.predict.time$tic),2)
Model Run Time (s)
Log. Reg. 5.96
LDA 7.06
QDA 6.48
KNN 367.61
RF 17.62
SVM 116.96

Run time varaition between devices, qualitative measure only.

(discussion on FHO data why we do this… what the benefits are… potential pitfalls)

(discussion somewhere about ROC curves AUC and… other metrics)

handling imbalanced classes

Conclusions

Future Work

#consider if I was able to find an additional data source like lidar or infrared to pair with this dataset to improve model performance... ? 

References and Works Cited

Appendix A: Analysis Methods Reference Info

LDA QDA
Assumptions this is a lot of text what happens when you put this much text in this table
Tuning Parameters

SVM - uses maximal margin classifier (requires classes be able to be separated by a linear decision boundary) hyperplane is selected to best separate two classes. In p-dimension space, the hyperplane is of dimension p-1. Hyperplane (2D) is defined by: β0 + β1X1 + β2X2 = 0 if value of hyperplane is > or < 0 then it is on either side of the boundary. Support vectors are used to construct the margin. Support vector classifier (soft margin classifier) is used for the non-separable case.

QUESTIONS FOR OH * the auc for the FHO data should be the same as the auc for the training data right?

#> Knit Time: 862.61 sec elapsed